For the explore and summarise data project, the propserLoanData data set is what I chose. The data set has 81 columns & 113937 observations.
#converting the ListingCategory to factor variable & associating them
#for labeling purposes
prosperLoanData$ListingCategory..numeric. <-
factor(prosperLoanData$ListingCategory..numeric.)
cats <- c('0 - Not Available', '1 - Debt Consolidation',
'2 - Home Improvement', '3 - Business', '4 - Personal Loan',
'5 - Student Use', '6 - Auto', '7- Other', '8 - Baby&Adoption',
'9 - Boat', '10 - Cosmetic Procedure', '11 - Engagement Ring',
'12 - Green Loans', '13 - Household Expenses',
'14 - Large Purchases', '15 - Medical/Dental', '16 - Motorcycle',
'17 - RV', '18 - Taxes', '19 - Vacation', '20 - Wedding Loans')
#converting Term into a factor variable
prosperLoanData$Term <- factor(prosperLoanData$Term)
Loan Original Amount has a positive skew. The mean value for the LoanOriginalAmount is 8337 which is greater than the median. The median for the LoanOriginalAmount is 6500.
Monthly Loan Payment has a positive skew. The median is less than the mean for this data. The median for the monthly Loan Payment is 217 and the mean is 272.
credit score range lower has a normal distribution. The mean/median in this case is 685.56/680
BorrowerRate also has a normal distribution and the mean/med ratio is 0.19/0.18 However, there are various spikes especially between 0.24 and 0.31.
Debt To Income Ratio exhibits a slight positive skew with mean > median. Mean is 0.275 and median is 0.22. DI ratio is less than 0.5 for a good percentage of pepole.
Stated Monthly Income has a positive skew with mean > median. Mean for the Stated Monthly Income is 5608 and the median is 4666.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 637.5 1061.5 1224.9 1622.4 171004.2 8554
MonthlyDebtTotal has been added as a new column. This was derived by multiplying the debt to Income Ratio to the stated Monthly Income.
Total Monthly debt has a positive skew on its histogram and the mean is greater than the median. The mean is 1224. The median is 1061. The first quartile for this data is 637 and the third quartile is 1622.
The Employment Status Duration exhibits a positive skew. The mean of the data is higher than the median. The mean is 96 and the median is 67. It is observed that more number of employees worked for shorter durations.
Current Credit Lines has a slight positive skew. The mean/median ratio is 10.31/10.
Loans originate mostly during October, December and January
During the year 2009, there was a drop. This can be attributed to the economic crisis during that year.
##
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
Most of the loans are either in the completed status or in current status.
## $0 $1-24,999 $100,000+ $25,000-49,999 $50,000-74,999
## 621 7274 17337 32192 31050
## $75,000-99,999 Not displayed Not employed
## 16916 7741 806
Most of the loans are taken by people with income range between 25k USD to 50k USD and 50k to 75k USD.
## 12 36 60
## 1614 87778 24545
##
## 12 36 60
## 1614 87778 24545
Most of the loans are of duration 36 months (3 years)
Top 10 states by the number of borrowers were plotted and this the state California topped the list.
Since the most of the loans belonged to the category of “Other”, the data is not fully sufficient to analyse people with what kind of occupation go for the highest loans.
## [1] "Borrowers - Are they home owners?"
## False True
## 56459 57478
Almost 50% of the borrowers are home owners
Employment Status data doesnt seem to be complete. Also, as anticipated, the employed categroy go for the highest number of loans.
Tip: Now that you’ve completed your univariate explorations, it’s time to reflect on and summarize what you’ve found. Use the questions below to help you gather your observations and add your own if you have other thoughts!
The data set has around 113937 observations and around 81 columns. For our analysis a combination of discrete and continuous variables are being used. All of the variables are not being used. We are using around 20 variables here.
The data set is huge and the more we explore, the more can we get deeper into the data. For now, the focus is on some of the below areas, which was found interesting and intriguing: credit score, debt to income ratio, income, loan amount, interest rate, term etc
Loan Category, Loan Amount, Loan Status etc are available and these will help analyse the data better
LoanYear, LoanMonth etc were derived from the LoanOriginationDate. Also, a new column MonthlyDebtTotal was added by multiplying DI ratio with the stated Monthly Income
Term and ListingCategory..numeric. were converted to factor variables. They were initially categorical values. It was done while making a bar plot which expected that these values are discrete.
## Warning: Removed 16724 rows containing missing values (geom_point).
## $title
## [1] "Current Credit Lines Vs Total Monthly Debt"
##
## $subtitle
## NULL
##
## attr(,"class")
## [1] "labels"
There is a clear positive correlation between Current CreditLines and the Total Monthly Debt and the value of the correlation coeffient is 0.47.
## Warning: Removed 766 rows containing missing values (geom_point).
There is a clear negative correlation betwen the CreditScores and the borrower rate. The correlation coefficient is -0.46. Persons with a lower credit scores get the loans at a lower borrower rate.
## Warning: Removed 11194 rows containing missing values (geom_point).
There is a positive correlation between the stated Monthly Income and the total monthly debt. The correlation coefficient is 0.36.
The correlation between the loan Amount and Borrower Rate is negative. The value of the correlation coeffient is -0.32. This would mean that the higher loans are disbursed at a lower interest rate.
## Warning: Removed 591 rows containing missing values (geom_point).
There is a slight positive correlation between loan Amount and Credit Scores. Higher the loans, the credit scores can increase.
## Warning: Removed 11165 rows containing missing values (geom_point).
Monthly Income Vs Total Monthly Debt was plotted and was facet wrapped by IsBorrowerHomeowner.The dispersion is higher in the case of home owners and for the non home owners, the dispersion is concentrated more amoung the borrowers with an income of around 5000.
The median amount dips in 2009 and has a sharp rise in 2013.
Dec to Feb is the period when the loan amounts are usually higher.
The higher amount loans are taken for Debt consolidation and baby and adoption
Borrowers with large incomes take large loan.
Employed and Self employed are the ones who are taking higher loans
Higher Loans Amounts are repaid thorough longer terms
Higher the term of repayment does not always means that the interest rates are higher.
Term and Credit Scores - Not much insights can be drawn from this plot.
Total Monthly Debt and Current Credit Lines have a strong positive correlation Longer the term of the loan, it is found that the loan amount is also larger Employed and self employed ones borrow larger amounts
Could not do this analysis. The credit scores and the salary ranges, credit scores and employment status are some of the relationships I wanted to analyse.
CurrentCreditLines and MonthlyDebtTotal had a strong +ve correlation with a correlation coefficient of 0.47. Also, credit scores and borrower rate have a strong positive correlation with a correlation coefficient of 0.46
The top left ones are the ones which takes loans for a smaller amount and at higher rates. These are the borrowers with lower credit scores
MonthlyLoanPayment was divided by the StatedMonthlyIncome to observe the yearwise debt coverage ratio. This was faceted by IncomeRange. All the income ranges have an uptick in the most recent years.
Most of the borrowers seem to have DI ratio of 0.2 to 0.3 and is showing an uptick in the most recent years. Borrowers with income range $100k + have a low DI ratio and is around 0.15-0.20
Borrowers with lower credit scores takes loans for a smaller amount, however at a higher interest rate
Also, it was observed that borrowers with the highest income have a lower percentage of debt.
Taking a look at the DI ratio, it was observed that higher the income, the lower is the percentage of debt.
Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.
The above plot shows how the lending platform Prosper has matured over years. Red shows higher credit scores and we can see that by the year 2014, blue dominates,which would also mean that the over all risk exposure has increased. (For better visualisation, only credit scores from the first quartile are plotted here)
As the prosper lending platform matures, probably the risk management also has improved. This could be the reason, the borrower rates are decreasing for larger loans, mainly for those with lower debt to income ratio.
Only deliquent loans were subsetted here. The majority of the delinquencies are clustered in 1-15days past due. Maximum delinqueny is happening for the borrowers having income ranges between $25k-50k and $50-75k.
The data set has almost 114,000 loans.
The most time consuming part of this analysis was to find some variables which will help make some good analysis & to list down what are the analysis that has to be performed.
There was too much of information that, it was not difficult to atleast perform the single variable and bi variable analysis. There were too many choices. However, drilling down into multi variable analysis from the results of the bi variable analysis was again difficult.
Making the choice for plotting three plots as the outcome of the entire analysis was again difficult since there was too much of data to be chosen from.
Regression models & some predictions are what I would have tried working on if time permitted. For example, preclosure, delinquency, late payment etc if it can be predicted, it would have been good.
One of the good decisions was to restrict the analysis to limited variables and not look into the entire data to perform the analysis. This helped finish the analysis comparitively faster.